Week 9.1 - The Trajectory of LLM Capabilities

🎯 What We'll Cover

The way researchers talk about “what AI can't do” goes stale faster than almost any other technical claim in the literature. A paper from 2023 documenting that ChatGPT failed at undergraduate mathematics is, by 2026, a historical artefact rather than a current diagnosis. A claim from 2024 that AI can't do scientific writing well no longer matches what frontier models produce today. The artefact under study has changed. The literature about it has not always kept up.

This sub-lesson does two things. First, it lays out the trajectory of LLM capabilities from 2019 to May 2026, so you have a calibrated sense of where the frontier sits right now. Second, it makes the central pedagogical move of the whole week explicit: every claim about AI limitations carries an implicit “as of [date]”. Reading the AI literature critically means asking which model, which date, and whether anyone has retested.

The week's centrepiece reading lives at the end of this sub-lesson: a May 2026 blog post by the Fields-medallist Tim Gowers reporting on his experience using ChatGPT 5.5 Pro to extend bounds on an open problem in additive number theory. Read it before moving on to the next sub-lesson.

📈 A Seven-Year Sweep

The pace of capability change over the last seven years is worth pausing on. Below is a deliberately compressed history. Each row marks a moment when the answer to “what can these systems do?” shifted enough that the previous answer became wrong.

Year	Frontier model(s)	What was new	What still failed
2019	GPT-2 (1.5B)	OpenAI initially declined to release weights citing misuse risk; sparked first wave of policy debate.	Output became incoherent past a few paragraphs; no real reasoning; minimal factual reliability.
2020	GPT-3 (175B)	Few-shot in-context learning — the system could do tasks it had never been trained on if given examples in the prompt.	Confidently fabricated facts; couldn't reliably do basic arithmetic; no instruction-following beyond pattern matching.
2022–23	ChatGPT, GPT-4, Claude 2	Instruction-following via RLHF; emergent reasoning on graduate-level tasks; mass adoption.	Hallucinated citations at high rates (Week 5); the “reversal curse”; failed graduate-level mathematics; limited tool use.
2024	GPT-4o, Claude 3.5 Sonnet, Gemini 1.5	First generation with measurable drops in basic-failure rates; multimodal inputs (Week 8); early agentic capability.	Still unreliable on niche-topic citations; sycophancy persisted; long-context performance degraded mid-document.
2025	Claude 4 / Opus 4.6, GPT-5, Gemini 2.5 / Deep Think	IMO 2025 gold-medal performance (35/42, end-to-end natural language, within the 4.5-hour competition window). FrontierMath at ~50%. Frontier coding capability.	Domain-specific brittleness remained; cross-disciplinary verification still required human expertise; calibration still imperfect.
May 2026	Claude Mythos, Opus 4.7, GPT-5.5 Pro, Gemini 3.1 Pro, DeepSeek V4 Pro…	Research-grade contributions in mathematics (Erdős problems) and theoretical physics (gluon and graviton amplitudes). Open-weights frontier (DeepSeek) competitive with closed.	The structural failures — long-tail brittleness, compositional errors, illusions of understanding — persist (see 9.2 and 9.4).

💡 Why This Matters Now

If you arrived in this course with a calibration of AI based on news coverage from 2023, you are in roughly the position of someone reading 2010 reviews of smartphone cameras. Some core observations still apply — cameras have lenses, smartphones have batteries — but the specifics are out of date.

The recalibration matters in two directions. Some things you might think AI can't do, it now can. Some things you might think AI can do reliably, it cannot. Both errors are common in published research right now. This sub-lesson and the next two address them in turn.

🔬 The May 2026 Frontier — Closed-Weights Models

As of mid-May 2026, the closed-weights frontier comprises six families of models, all from large well-resourced labs. Each has slightly different strengths, and the leaderboards shuffle every few weeks. The numbers below are accurate at time of writing and will be wrong within months — that itself is a teaching point.

Anthropic — Claude Mythos (Preview)

Released as a limited preview in May 2026. Currently leads the GPQA Diamond reasoning benchmark at 94.6% and HLE at 64.7. Notable in security: autonomously identified and exploited a 17-year-old remote-code-execution vulnerability in FreeBSD with no human guidance after the initial request. As of early May 2026, the UK AI Security Institute reported GPT-5.5 reaching rough parity with Mythos on offensive-cyber evaluations (GPT-5.5 71.4% vs Mythos 68.6% on multi-step attack simulations) — so Anthropic's earlier unique lead in this domain has narrowed. Available to limited partners via Project Glasswing.

Anthropic — Claude Opus 4.7

Generally available since 16 April 2026. SWE-bench Verified 87.6% (up from 80.8% on Opus 4.6). MCP-Atlas (multi-tool agentic) 77.3%. Coding lift of 13% over Opus 4.6. Vision resolution roughly tripled. The current default for production-grade research workflows.

OpenAI — GPT-5.5 Pro & GPT-5.5

Top of the FrontierMath leaderboard. GPT-5.5 Pro at 52.4%; GPT-5.5 at 51.7%. The GPT-5.4 Pro variant (50%) is the model behind the Erdős problem solutions discussed in 9.3. The GPT-5.2 Pro variant produced the gluon-amplitude result with the Strominger group.

Google DeepMind — Gemini 3.1 Pro / Deep Think

Achieved IMO 2025 gold medal: 5/6 problems solved, 35/42 points, end-to-end natural language within the competition's 4.5-hour window — up from a silver in 2024. Long context, multimodal, and the Deep Think variant powers the cosmic-string radiation result discussed in 9.3.

Meta — Muse Spark

Released 8 April 2026. Meta's first frontier model since Llama 4, and notably their first not released as open weights. Scores 52 on the Artificial Analysis Intelligence Index. Natively multimodal with text, image and voice input. Includes a parallel-reasoning “Contemplating mode”. Powers Meta AI inside WhatsApp, Instagram, Facebook, Messenger and Ray-Ban smart glasses.

Other major closed labs

xAI's Grok 4, Mistral's Large 3, Cohere's Command R+ generation, and Inflection-acquired models from Microsoft also sit in the broad frontier band. A new OpenAI model (codename “Spud”) is reportedly imminent at time of writing. Track llm-stats.com or benchlm.ai for current state.

🔒 The May 2026 Frontier — Open-Weights Models

A meaningful part of the 2026 story is that open-weights models have closed most of the gap. This wasn't true in 2023–24, where closed labs had a clear lead on every important capability. As of May 2026, an open-weights model leads the world on competitive programming, and matches the closed frontier on most agentic and coding benchmarks at substantially lower cost.

DeepSeek V4 Pro

Released 24 April 2026 under the MIT licence. Mixture-of-experts architecture: 1.6 trillion total parameters with 49 billion active per token. 1 million token context window. Codeforces rating 3206 — the highest competitive-programming score of any model, surpassing GPT-5.4's 3168. SWE-bench Verified 80.6%. Roughly 10–13× cheaper per output token than the closed frontier on most agentic benchmarks.

Kimi K2.6 (Moonshot AI)

Released 20 April 2026 by the Beijing-based startup Moonshot. Mixture-of-experts: 1 trillion total parameters with 32 billion active. 256K context window. SWE-Bench Verified 85.4 % — effectively ties GPT-5.5 (85.1%) at roughly 80% lower cost per token. Modified MIT license (display requirement only kicks in above 100M monthly active users or $20M monthly revenue). A second open-weights model now matching closed-frontier coding capability is a notable trajectory data point.

DeepSeek V4 Flash

The efficiency variant: 284B total parameters, 13B active. Designed for high-throughput deployment where Pro's scale is unnecessary.

Qwen 3.6 (Alibaba) & Gemma 4 (Google DeepMind)

Qwen 3.6 27B (April 2026): Apache 2.0, 262K context window, native multimodal input, BF16 weights small enough to fit on a single H100. Top of Artificial Analysis's Intelligence Index for open-weights models under 150B parameters at 46. Hybrid thinking model (toggle chain-of-thought on/off). Reaches frontier on AIME 2026 (~91%). Both Qwen 3.6 and Gemma 4 implement Multi-Token Prediction (MTP) — predicting several future tokens per forward pass for substantial inference speed-ups, an emerging cross-model technique.

Other open-weights players

Tencent's Hy3-preview (295B/21B MoE, restricted commercial license) and xAI's Grok 4.3 (~40–60% cheaper than Grok 4.20 v2) sit just below the frontier. Ant Group's Ling 2.6 1T targets cost-efficiency at $95-per-benchmark-run but reports a 92% hallucination rate on AA-Omniscience — a useful warning that not all open-weights releases inherit the leading labs' mitigations (see 9.2). The Mistral and Falcon families remain strong on European-deployment grounds. Zyphra ZAYA1-74B (May 2026) is a non-standard transformer architecture with hybrid attention — an architectural-innovation data point.

⚠️ The Pace of Change

Every model and number above will be wrong within months. By the time you sit this course, several of these systems will have been replaced or substantially upgraded; new entrants will have appeared; some leaderboards will have saturated. Treat the table as a snapshot and use the llm-stats.com and benchlm.ai leaderboards as living references.

🧮 What These Benchmarks Actually Measure

Benchmark scores are now the standard currency of AI capability claims. They are useful, but only if you know what each one tests. A 90% on AIME and a 50% on FrontierMath sound very different but might describe the same model on tasks of very different difficulty.

Benchmark	What it measures	Current SOTA (May 2026)	Created by / source
FrontierMath	350 expert-level mathematics problems (300 Tier 1–3 + 50 Tier 4 unpublished). Evaluated with Python tools.	GPT-5.5 Pro 52.4%	Epoch AI; problems set by IMO gold medallists and Fields Medal recipients.
GPQA Diamond	Graduate-level physics, biology and chemistry reasoning. Designed to be Google-proof.	Claude Mythos 94.6%	Frontier reasoning benchmark; widely used to discriminate among frontier models.
SWE-bench Verified	Real-world software engineering: GitHub issues with verified human-validated solutions.	Opus 4.7 87.6%	Princeton/Anthropic-validated subset of the original SWE-bench.
AIME 2026	American Invitational Mathematics Examination — high-school olympiad level.	Qwen3.5-plus 91.3%	Closing in on saturation; useful but no longer discriminating at the top.
MCP-Atlas	Multi-tool agentic workflows using the Model Context Protocol.	Opus 4.7 77.3%	Anthropic; tracks performance on real agentic deployments.
HLE	Human Language Evaluation — broad reasoning across domains.	Claude Mythos 64.7	Composite reasoning measure.
Codeforces (Elo)	Competitive programming rating against human contestants.	DeepSeek V4 Pro 3206	Notable: an open-weights model leads here, ahead of GPT-5.4's 3168.

🔎 Reading Benchmark Scores Critically

Three pitfalls to watch for in announcements:

(1) Benchmark mismatch. A 90% on AIME 2024 does not predict a 90% on FrontierMath. The benchmarks measure different things at different levels of difficulty.

(2) Saturation. When the SOTA on a benchmark exceeds about 90%, the benchmark has stopped discriminating and the remaining gap is largely noise.

(3) Selection. Labs cherry-pick benchmarks favourable to their model. Always check whether the announcement reports performance on benchmarks where competitors lead.

🚫 What Benchmarks Don't Tell You

Benchmark scores are easy to read and easy to compare. They give the illusion of an objective measure of capability. They are, in fact, a partial and contested measure — and three specific failures of benchmarks are worth knowing before you read any model announcement uncritically.

1. Goodhart's Law: Benchmarks Are Not Real-World Use

📝 Goodhart's Law (Strathern's formulation, 1997)

“When a measure becomes a target, it ceases to be a good measure.”

Charles Goodhart, a British economist, originally articulated the principle in 1975 in the context of monetary policy: any observed statistical regularity tends to collapse once pressure is placed upon it for control purposes. The popular formulation above is from anthropologist Marilyn Strathern in 1997.

The application to AI benchmarks is direct. The benchmarks were originally proxies for capability — if a model scores well on FrontierMath, that's evidence the model can do research-level mathematics. But once benchmarks become the public yardstick for model quality, labs train and tune specifically to perform well on them. The benchmark stops being a proxy for capability and becomes a target the model is optimised to hit. The resulting score is no longer a good measure of the underlying capability the benchmark was meant to track.

A concrete consequence: a model that scores 87.6% on SWE-bench Verified is genuinely good at GitHub-issue-style coding tasks. It may or may not be good at the coding work you need to do. A model that scores 52.4% on FrontierMath is genuinely capable on Fields-medal-curated mathematics problems. It may or may not be useful for the specific mathematical questions in your field. The benchmark measures performance on the benchmark; that's all it can measure.

This is why Sub-Lesson 9.6 asks you to test current frontier models on tasks from your own field, rather than rely on published benchmark scores. The benchmarks tell you what the labs have optimised for. They tell you very little about whether the model will work for what you need.

📝 A Concrete Goodhart Illustration: SWE-Bench → ProgramBench

Frontier models score above 87% on SWE-bench Verified (Opus 4.7) and 85% on it (Kimi K2.6, GPT-5.5). The benchmark is approaching saturation. If you read those scores as a measure of “the model is now good at software engineering,” you would be making the Goodhart error.

ProgramBench (May 2026) tests something different: given only a program and its documentation, can the model architect and implement a codebase from scratch that matches the reference behaviour? No method signatures to fill in, no class skeletons, no natural-language descriptions of file layout. 200 tasks sourced from real-world open-source GitHub repositories, with more than 248,000 behavioural tests.

The result on the same frontier models that nearly saturate SWE-bench: 0% pass rate on fully resolving any task. Across all nine frontier models tested — including Opus 4.7, GPT-5.4, Gemini 3.1 Pro — not one fully resolved task. Even partial credit was sparse (Opus 4.7 reaches 95% test-pass rate on only 3% of tasks).

The same models, the same general task family (software engineering), wildly different scores. SWE-bench measures “can the model solve well-scoped patches to existing code given the issue description.” It does not measure “can the model architect and build a real piece of software.” The benchmark you cite determines the capability claim you can make.

Source: ProgramBench: Can Language Models Rebuild Programs From Scratch? (arXiv:2605.03546, May 2026)

📝 The Hidden Variable: Harness Engineering

A second Goodhart-adjacent finding has been hardening across 2026: a model's benchmark score conflates two things — the model's underlying capability, and the engineering work done around the model (system prompt, tools, middleware, decomposition strategy, retrieval scaffolding). The community has started calling this latter component the harness, and the harness now often dominates.

The Agentic Harness Engineering paper documents the effect: the same model on Terminal-Bench 2 went from 69.7% to 77.0% over ten iterations of pure harness improvement — beating a human-designed Codex-CLI baseline of 71.9% — while reducing token use on SWE-bench Verified by 12%. Other reports converge on the same observation: 10–20 point swings on tau2-bench depending on prompts and middleware, while the underlying model is identical. Source: Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses, arXiv:2604.25850 (April 2026).

The implication for reading announcements: a 90% score on a benchmark may reflect a 75% model with a great harness, or an 85% model with a mediocre harness. When a lab announces SOTA on benchmark X, you cannot disentangle the harness contribution from the model contribution. For your own work, the corollary is that switching to the latest frontier model is often less impactful than improving the harness around the model you already use.

2. Benchmark Contamination — Training-Data Leakage

A benchmark only measures capability if the model hasn't already seen the answers. The internet contains most published benchmarks. Models trained on web-scale data have, in many cases, seen the benchmark questions and answers during training — producing inflated scores that don't reflect generalised capability.

📝 The MMLU Contamination Evidence

Deng, C., Zhao, Y., Tang, X., Gerstein, M., & Cohan, A. (2023). Investigating Data Contamination in Modern Benchmarks for Large Language Models. arXiv:2311.09783.

Used a “Testset Slot Guessing” methodology: mask one wrong answer in an MMLU multiple-choice question and ask the model to fill it in. A model that has never seen the question should perform near chance. ChatGPT achieved 52% exact-match on the masked options; GPT-4 achieved 57%. Far above chance, and consistent with having seen the test set.

A subject-level scan of MMLU found contamination rates as high as 66.7% in some subjects. Variations of the contamination methodology have been documented in dozens of papers since — benchmark contamination is now a recognised systematic problem, not a one-off finding. The Awesome Data Contamination repository tracks the literature.

The implication: a model's reported MMLU score may overstate its general reasoning ability, possibly substantially. Comparable concerns apply to AIME, GSM8K, HumanEval, and most popular benchmarks — all of which have appeared in training corpora.

Three responses have emerged:

Held-out / private benchmarks. FrontierMath was deliberately constructed as 350 entirely new and unpublished problems. The Tier 4 set has 20 problems explicitly held out from OpenAI (the funder) so a private set remains for honest evaluation; the leaderboard scores reported in this lesson are evaluated on the private sets. CharXiv (Week 8) used a similar design.
Cleaned and refreshed benchmarks. MMLU-Pro (Wang et al. 2024) increased question difficulty and expanded answer choices to make memorisation less helpful. MMLU-Redux (2025) cleaned ambiguous questions and incorrect answers from the original.
Dynamic benchmarks. Benchmarks regenerated continuously based on training cutoffs, so contamination is structurally impossible. LatestEval (AAAI 2024) is one example.

When you read a benchmark score in a model announcement, ask: was this evaluation done on a private/held-out set, or on a public benchmark that was probably in the training data?

3. What Benchmarks Don't Test — Low-Resource Languages and African Contexts

Almost every major benchmark in the table above — FrontierMath, GPQA Diamond, SWE-bench Verified, AIME, MCP-Atlas, HLE — is administered in English. A model that excels on these benchmarks is genuinely good at English-language reasoning on topics well-represented in English-language training data. The leaderboard scores tell you very little about how the model performs on questions in Setswana, on legal reasoning grounded in South African law, on medical questions framed for a Nigerian or Kenyan clinical context, or on mathematics problems written by African educators for African students.

For UCT students, this matters directly: the benchmark scores you see in news coverage may not predict how the same model performs on your research questions, your students' questions, or your local context.

📝 The Measured Gap

The Bridging the Gap project (Alhanai, Kasumovic, Ghassemi, Zitzelberger, Lundin & Chabot-Couture, AAAI 2025; arXiv:2412.12417) translated portions of MMLU and Winogrande into eight low-resource African languages (Amharic, Bambara, Igbo, Sepedi, Shona, Sesotho, Setswana, Tsonga) covering more than 160 million speakers. The paper reports “previously unknown performance gaps” between state-of-the-art LLMs in English and these African languages, and explores fine-tuning, cross-lingual transfer, and cultural-appropriateness adjustments as mitigations — achieving incremental improvements (5.6% from fine-tuning, 2.9% from cross-lingual transfer) that are notable precisely because the underlying gap is so substantial.

Even the “multilingual” benchmarks tend to under-cover Africa. MMLU-ProX covers 29 languages including Wolof, Yoruba, and Zulu, but documents consistent performance degradation from high-resource to low-resource languages. Most African languages aren't covered at all.

Africa-specific benchmarks have begun to emerge in response to the gap. They are useful but not yet widely cited in model announcements:

AfroBench (ACL 2025) — comprehensive evaluation of LLMs on African languages across multiple NLP tasks.
IrokoBench — benchmark designed specifically for African languages in the LLM era.
AfriSpeech-Dialog (NAACL 2025) and Intron Sahara v2 (2026, 57 languages) — African-language ASR benchmarks (covered in Week 8).
Nahabwe et al. (2025), Benchmarking Automatic Speech Recognition Models for African Languages, arXiv:2512.10968 — covered in Week 8.

The pattern: for well-represented African languages (Swahili, Hausa, Yoruba, Afrikaans), frontier models perform reasonably. For under-represented languages (Setswana, Sepedi, isiXhosa, isiZulu in many tasks), performance falls off substantially. Performance differences between languages are larger than performance differences between frontier models.

⚠️ A Practical Implication

If your research touches African languages, African legal systems, African clinical contexts, or African educational systems, do not rely on benchmark scores to predict model usefulness. The benchmarks don't test what you need to know.

Instead, build your own small evaluation: a set of representative questions from your domain, in the languages and contexts you actually work with. The hands-on activities in 9.6 explicitly ask you to do this for your own field. The capability and limitation maps that emerge will be far more informative than any benchmark leaderboard for your purposes.

The three failures together

Goodhart's law means benchmarks measure benchmark performance, not real-world capability. Contamination means even that benchmark performance is often inflated by training-data overlap. The African / low-resource gap means the benchmarks are answering questions that are mostly not yours.

None of this means benchmarks are useless. They are useful as a coarse comparison between current frontier models on the specific tasks they measure. They are not useful as a substitute for testing the model on your own work.

📚 Centrepiece Reading: Gowers on ChatGPT 5.5 Pro

Tim Gowers is a Fields medallist and a long-time chronicler of mathematical practice. In May 2026 he posted an account of an experiment in which he used ChatGPT 5.5 Pro to attack open problems from a paper by Mel Nathanson on additive number theory.

📝 What Gowers reports

Over roughly two hours of interaction, ChatGPT 5.5 Pro produced an original construction using h²-dissociated sets that improved the bounds on a function N(h, k) from exponential to polynomial in k. The construction was novel — not retrieved from training data — and Gowers describes it as “quite impressive”. Isaac Rajagopal, the original researcher who had been working on the problem, confirmed that the improvement was “almost certainly correct” at the level of ideas, not merely line-by-line.

Gowers' broader claim is striking: “The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove.”

He pairs this with a defence of human work: “by solving hard problems you get an insight…that you simply don't if all you do is read other people's solutions.”

📖 Required Reading

Gowers, T. (8 May 2026). A recent experience with ChatGPT 5.5 Pro.

Read this before moving to the next sub-lesson. The full post is short, accessible without specialist additive-number-theory background, and lays the groundwork for the trajectory frame that runs through the rest of the week.

🎯 The Move This Sub-Lesson Makes

Date-stamp every claim about AI capability

From here on, when you encounter a claim about what AI can or cannot do — in this course, in published papers, in news articles, anywhere — reflexively check three things. Which model was tested? When was it tested? Has anyone retested with current frontier models?

Without these three pieces of information, an AI capability claim has the same evidentiary status as a 2010 review of smartphone cameras. The category is right; the specifics are stale.

Sub-lesson 9.2 takes this trajectory framing and applies it to specific failure modes. Some have been patched. Some have been reduced. Some are structural and likely permanent. Distinguishing them is the skill we are building.

👉 What Comes Next

Sub-Lesson 9.2 — Three Categories of Failure: Patched, Reduced, Structural. A taxonomy of AI failures that survives the next model release. We use the trajectory frame from this sub-lesson to separate failures that have been fixed from failures that are inherent to how language models work.